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Abstract.  This  paper  proposes  a  new  method  to  train  stable  extreme 
learning  machines  (ELM).  The  new  method,  called  StaELM,  uses  corre¬ 
lation  coefficients  in  Gaussian  process  to  measure  the  similarities  between 
different  hidden  layer  outputs.  Different  from  kernel  operations  such  as 
linear  or  RBF  kernels  to  handle  hidden  layer  outputs,  using  correlation 
coefficients  can  quantify  the  similarity  of  hidden  layer  outputs  with  real 
numbers  in  (0, 1]  and  avoid  covariance  matrix  in  Gaussian  process  to 
become  a  singular  matrix.  Training  through  Gaussian  process  results  in 
ELM  models  insensitive  to  random  initialization  and  can  avoid  over¬ 
fitting.  We  analyse  the  rationality  of  StaELM  and  show  that  existing 
kernel-based  ELMs  are  special  cases  of  StaELM.  We  used  real  world 
datasets  to  train  both  regression  and  classification  StaELM  models.  The 
experiment  results  have  shown  that  StaELM  models  achieved  higher 
accuracies  in  both  regression  and  classification  in  comparison  with  tra¬ 
ditional  kernel-based  ELMs.  The  StaELM  models  are  more  stable  with 
respect  to  different  random  initializations  and  less  over-fitting.  The  train¬ 
ing  process  of  StaELM  models  is  also  faster. 
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1  Introduction 

Extreme  learning  machine  (ELM)  is  a  special  single-hidden  layer  feed-forward 
neural  network  (SLFN)  [6].  Due  to  its  lower  computational  complexity  and  bet¬ 
ter  generalization  performance,  ELM  has  recently  attracted  a  lot  of  interests 
in  research  and  industry  and  is  used  in  a  wide  range  of  applications  [5].  ELM 
uses  a  random  method  to  determine  input  weights/hidden  layer  biases  and  ana¬ 
lytically  computes  the  output  weights.  Therefore,  it  is  extremely  fast  to  train 
an  ELM  model.  It  has  also  been  proved  that  ELM  can  guarantee  the  universal 
approximate  capability  of  ELM  [3] . 
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Currently,  ELM  has  two  main  problems  in  practical  applications.  The  first 
problem  is  that  the  trained  ELM  model  is  sensitive  to  the  random  initial  set¬ 
tings  [15].  Different  initial  settings  often  result  in  different  performances,  which 
implies  that  the  training  process  produces  instable  ELM  models  from  different 
initial  settings.  The  second  problem  is  over- fitting  [3],  which  is  usually  caused 
by  the  numerous  hidden  layer  nodes  specified  to  best  approximate  the  train¬ 
ing  data  set.  A  number  of  improvements  have  been  proposed  to  tackle  these 
problems.  One  approach  is  to  optimize  the  random  weights  with  different  evo¬ 
lutionary  algorithms.  Examples  include  E-ELM  [15],  SaE-ELM  [1],  and  O-ELM 
[9].  Another  approach  is  to  select  better  architectures  for  ELM,  for  instance,  I- 
ELMs  [3,4],  OP-ELM  [10],  and  localized  generalization  error  ELM  [13].  Although 
the  literatures  reported  the  better  performances  of  these  improved  ELM  models, 
the  higher  computational  complexity  makes  them  impractical  to  deal  with  the 
regression  and  classification  tasks  with  a  large  number  of  training  instances. 

A  different  direction  to  improve  ELM  without  increase  of  computational  com¬ 
plexity  is  to  estimate  the  prior  probability  distribution  of  ELM  models.  Soria- 
Olivas  et  al.  [12]  designed  a  Bayesian  ELM  (BELM).  Luo  et  al.  [8,14]  proposed 
sparse  Bayesian  ELM  (SBELM).1.  Chatzis  et  al.  [2]  proposed  the  one-hidden- 
layer  nonparametric  Bayesian  kernel  machine  (1HNBKM).  Because  BELM  and 
1HNBKM  used  linear  and  RBF  kernels  to  handle  the  hidden  layer  outputs,  we 
call  them  kernel-based  ELMs  in  this  paper.  The  empirical  analysis  shows  that 
kernel-based  ELMs  are  still  sensitive  to  random  initialization.  For  example,  there 
is  an  obvious  difference  between  the  predictive  results  of  BELM  and  1HNBKM 
on  Libras  Movement  dataset2  with  random  input  weights  in  intervals  [0, 1]  and 
[—1,1].  In  addition,  the  over- fitting  still  exists  for  1HNBKM. 

In  this  paper,  we  propose  to  use  Gaussian  process  to  train  ELM  models 
and  present  a  stable  extreme  learning  machine  algorithm,  StaELM.  In  this  algo¬ 
rithm,  we  use  correlation  coefficients  in  Gaussian  process  to  measure  the  sim¬ 
ilarity  between  different  hidden  layer  outputs  with  real  numbers  in  (0,1].  The 
advantages  of  using  Gaussian  process  in  ELM  model  training  over  aforemen¬ 
tioned  training  methods  are  that  the  training  process  is  fast  and  the  trained 
ELM  models  are  insensitive  to  random  initialization  and  can  avoid  over-fitting. 
In  the  training  process,  we  use  correlation  coefficients  to  avoid  the  covariance 
matrix  to  become  a  singular  matrix  and  make  the  inverse  of  covariance  matrix 
solvable. 

We  have  used  12  UCI  and  KEEL3  datasets  to  conduct  the  experiments  and 
compared  the  performances  of  accuracy  and  running  time  of  StaELM,  orginal 
ELM,  BELM,  and  1HNBKM.  The  experimental  results  show  that  StaELM  mod¬ 
els  achieved  higher  accuracies  and  lower  running  time  in  both  regression  and 

1  The  main  difference  between  SBELM  and  BELM  is  that  the  independent  regular¬ 
ization  priors  in  SBELM  are  imposed  on  each  weight  instead  of  one  shared  prior  for 
all  weights  in  BELM.  Because  SBELM  and  BELM  are  homologous,  we  only  discuss 
and  analyse  BELM  in  this  paper  due  to  its  simplicity. 

2  http://archive.ics.uci.edu/ml/ 

3  http://sci2s.ugr.es/keel/datasets.php 
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classification  than  other  methods.  The  StaELM  models  are  stable  with  respect 
to  different  random  initializations  and  less  over-fitting. 

The  rest  of  this  paper  is  organized  as  follows:  In  Section  2,  we  briefly  sum¬ 
marize  kernel-based  ELMs.  Section  3  introduces  our  proposed  StaELM.  Exper¬ 
imental  simulations  are  presented  in  Section  4.  Finally,  we  conclude  this  paper 
in  Section  5. 


2  Kernel-based  ELMs 

In  this  section,  we  review  three  existing  ELMs  models,  i.e.,  the  original  ELM, 
BELM,  and  1HNBKM.  Because  the  first  two  use  linear  kernels  to  handle  the 
hidden  layer  outputs  whereas  the  last  one  uses  RBF  kernel  for  that  purpose.  We 
call  them  kernel-based  ELMs. 


2.1  ELM 

ELM  [6]  is  a  single-hidden  layer  feed-forward  neural  network  (SLFN)  and  does 
not  require  any  iterative  optimization  to  input/output  weights.  Given  the  train¬ 
ing  dataset  (XNxD,  Yjvxc)  and  testing  dataset  (X'mxd,  Y'mxc): 
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where  N  is  the  number  of  training  instances,  D  is  the  number  of  input  variables, 
M  is  the  number  of  testing  instances,  and  C  is  the  number  of  output  variables. 
Usually,  Y'mxC  is  unknown  and  needs  to  be  predicted.  ELM  determines  Y' mxC 
as  follows: 


Y'  MxC  =  H '  MxLpLxC 


IT  (HtH)  1  HtY,  if  N  >  L 
H'Ht  (HHt)_1Y,  if  N  <L  ’ 


(3) 


where  (ihxC  is  the  output  weights,  L  is  the  number  of  hidden  layer  nodes, 
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is  the  hidden  layer  output  matrix  for  training  instances, 
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is  the  hidden  layer  output  matrix  for  testing  instances,  g  (z)  =  is  sigmoid 

activation  function, 


Wll  W21  ■ 

'  WL1 

■&r 

W dxL  =  [wi  w2  • 

•  wL]  = 

W12  W22  ' 

'  WL 2 

and  b  = 

_WiD  W2D  ' 

■  wLD_ 

bL_ 

are  input  weight  and  hidden  layer  biases  which  are  randomly  determined.  ELM  is 
sensitive  to  random  initialization  and  has  obvious  over-fitting.  In  order  to  tackle 
these  problems,  two  improvements,  i.e. ,  BELM  and  1HNBKM,  are  discussed 
below. 

2.2  BELM 

BELM  [12]  optimizes  the  output  weights  (3  by  using  Bayesian  linear  regression 
as  follows: 

V  =  h  (x)  /?  +  £,  (7) 

where  e  ~  N  (0,  cr^)  and  {3  ~  N  (0,  q:_1Ilxl)  ■  The  posterior  distribution  over 
output  weights  (3  is  expressed  as 

P  (/3  |H,  Y )  =  N  (/?,  5) ,  (8) 

where  (3  =  a~^2 5HTY  and  S  =  (aI-|-(Jiv2HTH)  are  the  mean  and  covariance 
matrix  respectively.  For  a  new  instance  x7  =  (a;^,  x'2,  ■  ■  ■  ,2)5),  the  output  y' 
predicted  with  BELM  obeys  the  Gaussian  distribution  N  (/i,u2),  where 

H  =  h  (x7)  (3,  (9) 

cr2  =  +  h  (x7)  S'h1’  (x7) .  (10) 

In  BELM,  y  is  deemed  as  the  prediction  of  new  instance  x7,  i.e.,  y'  =  /i, 
and  (72  is  the  variance  which  is  used  to  determine  the  confidence  interval  of 
prediction  y' .  There  are  two  parameters  that  need  to  be  determined  in  BELM: 
<j2n  and  a  >  0.  BELM  effectively  controls  the  over-fitting  but  still  sensitive  to 
randomly  initial  weights  is  still  not  solved. 
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2.3  1HNBKM 


Given  a  training  dataset  (Xjvxr>,  Yjvxc),  1HNBKM  [2]  predicts  the  output  y' 
for  new  instance  x7  via  the  following  joint  probability  distribution 


N  O 


K  (H,H)  +  erjyl  kT  (h  (x7) ,  H) 
k  (h  (x7) ,  H)  fc(h(x7),h(x7)) 


(11) 


where  the  meaning  of  a2N  is  same  as  in  BELM, 


K(H,H  )NxN 


'  k  (h(xi) ,  h  (xi))  fc(h(xi),h(x2))  fc(h(xi),h(xjv))‘ 
k  (h(x2) ,  h  (xi))  fc  (h  (x2)  ,h(x2))  k  (h  (x2) ,  h  (xjv)) 


(12) 


_k  (h(xjv)  ,h(xi))  fc(h(xjv)  ,h(x2))  ■■■  k  (h  (xjv)  ,h(xjv))_ 
is  a  kernel  matrix, 

k  (h  (x7) ,  H)  =  [k  (h  (x7) ,  h  (xi))  k  (h  (x7) ,  h  (x2))  •  •  •  k  (h  (x7) ,  h  (xjv))  ]  (13) 

is  a  kernel  vector,  and 


k  (u,  v)  =  exp  ^-^2A2^  j  ■  (14) 

is  the  RBF  kernel  function.  We  can  find  k  (u,  v)  =  1,  when  u=v. 

Then,  the  posterior  distribution  of  predicted  output  y'  is 

P  (y1  |h  (x7) ,  H,  Y )  =  N  (/i,  cr2)  ,  (15) 

where 

M  =  k  (h  (x7) , H)  (K  (H, H)  +  all)'1  Y,  (16) 

a2  =  k  (h  (x7) ,  h  (x7))  -  k  (h  (x7) ,  H)  (K  (H,  H)  +  a2Nl) _1  kT  (h  (x7) ,  H) .  (17) 

Similarly,  y  is  the  prediction  of  x7  and  cr2  is  the  variance  of  prediction.  Param¬ 
eters  a2N  and  A2  are  unknown  and  need  to  be  determined.  1HNBKM  suffers  severe 
over-fitting  due  to  the  usage  of  the  RBF  kernel  and  is  also  sensitive  to  random 
initialization. 


3  Gaussian  Process-based  Stable  ELM 

In  this  section,  we  describe  our  proposed  StaELM  which  conducts  the  inference 
based  on  Gaussian  process  and  uses  the  correlation  coefficient  to  construct  the 
covariance  matrix.  StaELM  also  predicts  the  output  y'  for  new  instance  x7  based 
on  Eq.  (15),  where 


M  =  q  (h  (x7) ,  H)  (Q  (H,  H)  +  all)  ‘y, 


(18) 
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Fig.  1.  Comparison  on  Matlab  computational  time  between  kernel  and  correlation 
matrices  in  HNBKM  and  StaELM  respectively 


o'2  =  oat  +  q  (h  (x')  ,h  (x'))  -  q  (h  (x')  ,  H)  (Q  (H,H)  +  cr^l)  1  qT  (h  (x')  ,  H)  ,  (19) 


Q(H,H  )NXN 


'q(  h(xi),h(xi))  g(h(xi)  ,h(x2))  q  (h  (xi)  ,  h  (Xjv))  ' 
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_q  (h  (xjv)  ,  h  (xi))  q  (h  ( xN ) ,  h  (x2))  ■  •  •  q  (h  ( xN )  ,  h  (xjv))_ 


is  a  correlation  matrix, 


(20) 


q(h(x')  ,H)  =  [q  (h  (x')  ,  h  (xi))  q  (h  (x')  ,  h  (x2))  •  •  •  q  (h  (x')  ,  h  (Xiv))  ]  (21) 


is  a  correlation  vector, 


is  correlation  function,  and 


Cov  (u,  v) 

'J D  (u )V/D  (v) 


(22) 


(23) 


is  the  correlation  coefficient  which  measures  the  strength  and  direction  of  the 
linear  relationship  between  two  variables  u  and  v.  Cov  (u,  v)  is  the  covariance 
of  variables  u  and  v,  and  D  (u)  and  D  (v)  are  the  standard  deviations  of  u 
and  v  respectively.  Note  that  Eq.  (22)  is  to  normalize  the  correlation  coefficient 
into  interval  (0, 1].  Other  normalization  is  also  allowable.  We  can  find  that  the 
inference  process  of  StaELM  is  similar  to  1HNBKM.  The  main  difference  between 
StaELM  and  1HNBKM  is  that  StaELM  measures  the  similarity  between  two 
hidden  layer  outputs  with  correlation  function  in  Eq.  (22)  instead  of  RBF  kernel 
in  1HNBKM.  The  advantages  of  using  correlation  function  are  summarized  as 
follows.  (1)  The  correlation  coefficient  evaluates  the  relationship  between  two 
different  hidden  layer  output  h  (u)  and  h  (v)  with  probabilistic  approach.  This 
makes  StaELM  consider  the  inherent  prior  knowledge  of  training  dataset  more 
directly  and  comprehensively  than  kernel  function  based  1HNBKM.  (2)  The 
correlation  coefficient  reduce  the  chance  of  over-fitting  of  StaELM.  For  1HNBKM 
with  L  hidden  layer  nodes,  the  prediction  Y  for  training  dataset  is 

Y  =  K  (H,  H)  (K  (H,  H)  +  ct^I) _1  Y. 


(24) 
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(a)  Training  accuracy 


(b)  Testing  accuracy 


Fig.  2.  Curves  for  training  and  testing  accuracies  of  different  ELMs  changing  with  L  on 
Concrete  Compressive  Strength  Dataset.  StaELM  is  without  over-fitting  and  obtains 
higher  testing  accuracies  with  less  hidden  nodes. 


With  the  increase  of  L ,  k( u,v)  — >  0  when  u  ^  v.  This  leads  to  K(H,H)  — >  I. 
Then,  we  can  get  Y  — >  Y.  This  indicates  that  the  RBF  kernel  easily  results  in  the 
over-fitting  of  1HNBKM.  This  is  also  confirmed  by  the  following  experimental 
validation.  (3)  Calculating  the  correlation  matrix  in  Eq.  (20)  is  more  time-saving 
than  kernel  matrix  in  Eq.  (12).  We  validate  this  fact  via  the  following  simulation 
on  Matlab.  For  different  L ,  we  compare  computational  time  of  correlation  matrix 
and  kernel  matrix.  From  Fig.  1,  we  can  see  that  the  computational  time  of  kernel 
matrix  grows  exponentially  with  the  increase  of  N. 

StaELM  is  derived  from  Gaussian  process  regression  (GPR)  y  =  h  (x)  f3  +  e 
with  prior  /3  ~  N  (0,  £)  and  e  ~  N  (0,  Ojyl)  [11].  The  mean  and  covariance  are 

E  [y]  =  h  (x)  E  [/3]  +  E  [e]  =  0,  (25) 

E  [yy'}  =  h  (x)  E  [/3/3T]  hT  (x')  +  E  [££T]  =  h  (x)  EhT  (x')  +  a2N.  (26) 

The  key  of  GPR  is  how  to  determine  the  term  h  (x)  EhT  (x')  in  Eq.  (26).  Because 
E  is  a  symmetric  positive  definite  matrix,  E  can  be  decomposed  into  AAT,  where 
A  is  a  lower  triangular  matrix.  Then,  we  can  get 

h  (x)  £hT  (x')  =  h  (x)  AAThT  (x')  =  [h  (x)  A]  [h  (x')  A]T 
=  <t>{ h  (x))  <t>T  (h  (x'))  =  k  (h  (x) ,  h  (x')) 

where  k  (u,  v)  is  a  kernel  function  which  is  used  to  measure  the  similarity  between 
u  and  v.  In  StaELM,  we  replace  k  (u,  v)  with  q  (u,  v)  and  use  the  correlation 
rather  than  distance  to  measure  this  similarity.  In  fact,  we  can  find  that  the 
linear  kernel  k  (u,  v)  =  uvT  is  used  in  ELM  and  BELM.  Then,  ELM,  BELM, 
and  1HNBKM  are  all  the  specials  cases  of  StaELM  which  conducts  the  prediction 
based  on  GPR. 
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(a)  Training  accuracy 


(b)  Testing  accuracy 


Fig.  3.  Curves  for  training  and  testing  accuracies  of  different  ELMs  changing  with  L 
on  Treasury  Dataset 


4  Experiments 

In  this  section,  we  use  12  UCI  and  KEEL  datasets  to  compare  the  performances 
of  ELM,  BELM,  1HNBKM,  and  StaELM,  where  6  datasets  are  for  regression 
problems  and  the  other  6  datasets  for  classification  problems.  The  basic  descrip¬ 
tions  to  these  datasets  are  listed  in  Tables  1  and  2  respectively.  For  the  experi¬ 
mental  procedure  and  parameter  setting,  we  give  the  following  descriptions. 

-  The  input  variables  for  regression  datasets  are  normalized  in  [—1,1]  and  for 
classification  datasets  normalized  in  [0,1]. 

-  We  compare  the  training/testing  accuracies  and  time  for  different  learn¬ 
ing  algorithms.  The  accuracies  for  regression  and  classification  problems  are 
respectively  measured  with  root  mean  square  error  (RMSE)  and  correct  clas¬ 
sification  rate  (CCR).  The  experimental  results  are  the  averages  of  10  runs 
of  10-fold  cross-validation. 

-  In  our  comparison,  the  parameters  for  different  ELMs  are  set  as  (cr^-,a)  = 
(0.001,1)  in  BELM,  (a%,  A2)  =  (0.001,1)  in  1HNBKM  and  a2N  =  0.001  in 
StaELM  respectively.  The  input  weights  and  hidden  biases  are  the  random 
numbers  in  [0, 1]. 

Tables  1  and  2  respectively  give  the  comparative  results  on  regression  and 
classification  datasets.  According  to  the  statistical  analysis  with  Wilcoxon  signed- 
ranks  test  at  95%  significance  level  [7],  we  know  that  StaELM  obtains  the  signifi¬ 
cantly  better  testing  accuracies  than  other  algorithms.  Meanwhile,  StaELM  also 
has  the  better  training  accuracies  than  ELM  and  BELM.  In  addition,  StaELM 
is  also  faster  than  1HNBKM.  On  Concrete  Compressive  Strength  and  Treasury 
datasets,  we  give  the  curves  for  training  and  testing  accuracies  changing  with 
the  number  of  hidden  layer  nodes  in  Figs.  2  and  3.  From  these  figures,  we  can 
clearly  see  that  ELM  and  1HNBKM  have  serious  over-fitting  problems.  With  the 
increase  of  L ,  the  training  RMSEs  of  ELM  and  1HNBKM  gradually  decrease. 
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However,  their  testing  RMSEs  initially  decrease  with  the  increase  of  L,  pass 
through  a  minimum,  and  then  increase.  Although  the  learning  curves  of  BELM 
and  StaELM  all  gradually  decrease  with  the  increase  of  L,  we  can  find  that 
StaELM  has  a  faster  convergence  speed  than  BELM.  This  indicates  that  StaELM 
can  obtain  the  lower  RMSE  with  less  hidden  layer  nodes. 

The  main  reason  that  1HNBKM  has  an  obvious  over-fitting  is  that  the  value 
of  RBF  kernel  in  Eq.  (14)  gradually  approaches  0  with  the  increase  of  L.  This 


leads  to  kernel  matrix  in  Eq.  (12)  approximates  an  identity  matrix.  Assume  the 
hidden  layer  outputs  of  instances  u  and  v  are 

hz,  (u)  =  [g  (ui)  g  (u2)  ■  ■  ■  9  (uL)  ]  ,  (28) 

hL(v)  =  [g(vi)  g(v2)  ■■■  g(vL)] ,  (29) 

where  ui  =  w/U  +  6/  and  Vi  =  w;v  +  6;,  l  =  1, 2,  •  •  ■  ,  L.  With  the  increase  of 

hidden  layer  nodes  from  L  to  L\  (L\  >  L),  the  hidden  layer  outputs  of  instances 

u  and  v  are  changed  into 

hzi  (u)  =  [g  (in)  ■■■  g  (uL)  g  (uL+ i)  ■■■  g  (ul  J  ]  ,  (30) 

h ii  (v)  =  [s(ui)  ■■■  g(vL)  g(vL+ 1)  •  •  ■  givL,)]  ■  (31) 

Then,  we  can  calaulate 

E  [g(ui)-g(vi)}2 

k( hL  (u)  ,hL  (v))  =  exp  - - — -  ,  (32) 

E  [g(n)-g(vl)}2+  E  b(*q)-s(«h)]2 
k(hLl  (u)  .hi,!  (v))  =  exp  -  ■  (33) 


Because  of  E  [g  (ui)  -  g  (in)}2  <  E  Iff  (&i)  -  9  (ii)?  +  E  Iff  (*h)  “  9  (^J]2> 

1=1  1=1  li=L 

k  (hL  (u) ,  hL  (v))  >  k  (hLl  (u) ,  hLl  (v))  (34) 

can  be  derived.  This  indicates  that  the  value  of  k  (h  (u) ,  h  (v))  gradually  decreases 
with  the  increase  of  hidden  layer  nodes,  k  (h  (u) ,  h  (v))  is  non-negative,  so  /c(h(u), 
h(v))  — >  0  with  the  increase  of  L.  This  leads  to  K  — >  I  and  Y  — >  Y. 

For  the  correlation  function  in  Eq.  (22),  there  is  not  an  ordering  relationship 
between  q(hL  (u) ,  hi  (v))  and  g(hil  (u) , hLl  (v)),  because  the  correlation  coef¬ 
ficient  in  Eq.  (23)  measures  the  similarly  between  h  (u)  and  h  (v)  with  vectorial 
angle  cosine  rather  than  distance  between  them.  The  increase  of  vector  dimen¬ 
sion  will  not  cause  the  vectorial  angle  cosine  approaches  0.  Then,  we  can  know 
that  Q  I  with  the  increase  of  L.  This  reduces  the  chance  of  over-fitting. 


Table  1.  Comparison  of  different  ELMs  on  6  REGRESSION  datasets 
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Note:  •  indicates  that  the  testing  accuracy  of  StaELM  is  significantly  better  than  the  corresponding  algorithm  based 
on  Wilcoxon  signed-ranks  test  at  95%  significance  level. 


Table  2.  Comparison  of  different  ELMs  on  6  CLASSIFICATION  datasets 
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Note:  Only  continues  inputs  are  used  for  every  dataset.  For  Phoneme  dataset,  10%  of  5404  instances  are  randomly  selected. 
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5  Conclusion 

In  this  paper,  we  have  presented  a  stable  ELM  (StaELM)  by  using  correlation 
coefficient  to  measure  the  similarity  between  different  hidden  layer  outputs.  We 
have  further  analysed  the  rationality  of  StaELM  in  the  framework  of  Gaussian 
process  regression.  Compared  with  the  kernel-based  methods,  StaELM  obviously 
reduces  the  chance  of  singular  covariance  matrix  and  make  ELM  more  stable 
to  random  initialization  of  input  weights  and  hidden  biases.  In  addition,  our 
improvement  does  not  cause  the  significant  increase  of  computational  complexity. 

Acknowledgments.  This  work  was  supported  by  the  National  Natural  Science  Foun¬ 
dations  of  China  under  Grants  61170040,  71371063,  and  61473194. 


References 

1.  Cao,  J.,  Lin,  Z.,  Huang,  G.B.:  Self-Adaptive  Evolutionary  Extreme  Learning 
Machine.  Neural  Process.  Lett.  36(3),  285-305  (2012) 

2.  Chatzis,  S.P.,  Korkinof,  D.,  Demiris,  Y.:  The  one-hidden  layer  non-parametric 
bayesian  kernel  machine.  In:  23rd  IEEE  International  Conference  on  Tools  with 
Artificial  Intelligence,  pp.  825-831.  IEEE  Press,  New  York  (2011) 

3.  Huang,  G.B.,  Chen,  L.,  Siew,  C.K.:  Universal  Approximation  Using  Incremental 
Constructive  Feedforward  Networks  with  Random  Hidden  Nodes.  IEEE  Trans. 
Neural  Netw.  17(4),  879-892  (2006) 

4.  Huang,  G.B.,  Li,  M.B.,  Chen,  L.,  Siew,  C.K.:  Incremental  Extreme  Learning 
Machine  with  Fully  Complex  Hidden  Nodes.  Neuro computing  71(4-6),  576-583 
(2008) 

5.  Huang,  G.B.,  Wang,  D.H.,  Lan,  Y.:  Extreme  Learning  Machines:  A  Survey.  Int.  J. 
Mach.  Learn.  &  Cybern.  2(2),  107-122  (2011) 

6.  Huang,  G.B.,  Zhu,  Q.Y.,  Siew,  C.K.:  Extreme  Learning  Machine:  Theory  and 
Applications.  Neurocomputing  70(1),  489-501  (2006) 

7.  Janez,  D.:  Statistical  Comparisons  of  Classifiers  over  Multiple  Data  Sets.  J.  Mach. 
Learn.  Res.  7,  1-30  (2006) 

8.  Luo,  J.H.,  Vong,  C.M.,  Wong,  P.K.:  Sparse  Bayesian  Extreme  Learning  Machine 
for  Multi-Classification.  IEEE  Trans.  Neural  Netw.  Learn.  Syst.  25(4),  836-843 
(2014) 

9.  Matias,  T.,  Souza,  F.,  Araujo,  R.,  Antunes,  C.H.:  Learning  of  A  Single-Hidden 
Layer  Feedforward  Neural  Network  Using  An  Optimized  Extreme  Learning 
Machine.  Neurocomputing  129,  428-436  (2014) 

10.  Miche,  Y.,  Sorjamaa,  A.,  Bas,  P.,  Simula,  O.,  Jutten,  C.,  Lendasse,  A.:  OP-ELM: 
optimally  pruned  extreme  learning  machine.  IEEE  Trans.  Neural  Netw.  21(1), 
158-162  (2010) 

11.  Rasmussen,  C.E.,  Williams,  C.K.I.:  Gaussian  Processes  for  Machine  Learning.  The 
MIT  Press,  Cambridge  (2006) 

12.  Soria-Olivas,  E.,  Gomez-Sanchis,  J.,  Jarman,  I.H.,  Vila-Frances,  J.:  BELM: 
Bayesian  Extreme  Learning  Machine.  IEEE  Trans.  Neural  Netw.  22(3),  505-509 
(2011) 

13.  Wang,  X.Z.,  Shao,  Q.Y.,  Miao,  Q.,  Zhai,  J.H.:  Architecture  Selection  for  Networks 
Trained  with  Extreme  Learning  Machine  Using  Localized  Generalization  Error 
Model.  Neurocomputing  102,  3-9  (2013) 


Use  Correlation  Coefficients  in  Gaussian  Process 


417 


14.  Wong,  K.I.,  Vong,  C.M.,  Wong,  P.K.,  Luo,  J.H.:  Sparse  Bayesian  extreme  learning 
machine  and  its  application  to  biofuel  engine  performance  prediction.  Neurocom¬ 
puting  149,  397-404  (2015) 

15.  Zhu,  Q.Y.,  Qin,  A.,  Suganthan,  P.,  Huang,  G.B.:  Evolutionary  Extreme  Learning 
Machine.  Pattern  Recogn.  38(10),  1759-1763  (2005) 


